Ryuta SHINGAI Yuria HIRAGA Hisakazu FUKUOKA Takamasa MITANI Takashi NAKADA Yasuhiko NAKASHIMA
Modern deep learning has significantly improved performance and has been used in a wide variety of applications. Since the amount of computation required for the inference process of the neural network is large, it is processed not by the data acquisition location like a surveillance camera but by the server with abundant computing power installed in the data center. Edge computing is getting considerable attention to solve this problem. However, edge computing can provide limited computation resources. Therefore, we assumed a divided/distributed neural network model using both the edge device and the server. By processing part of the convolution layer on edge, the amount of communication becomes smaller than that of the sensor data. In this paper, we have evaluated AlexNet and the other eight models on the distributed environment and estimated FPS values with Wi-Fi, 3G, and 5G communication. To reduce communication costs, we also introduced the compression process before communication. This compression may degrade the object recognition accuracy. As necessary conditions, we set FPS to 30 or faster and object recognition accuracy to 69.7% or higher. This value is determined based on that of an approximation model that binarizes the activation of Neural Network. We constructed performance and energy models to find the optimal configuration that consumes minimum energy while satisfying the necessary conditions. Through the comprehensive evaluation, we found that the optimal configurations of all nine models. For small models, such as AlexNet, processing entire models in the edge was the best. On the other hand, for huge models, such as VGG16, processing entire models in the server was the best. For medium-size models, the distributed models were good candidates. We confirmed that our model found the most energy efficient configuration while satisfying FPS and accuracy requirements, and the distributed models successfully reduced the energy consumption up to 48.6%, and 6.6% on average. We also found that HEVC compression is important before transferring the input data or the feature data between the distributed inference processes.
Vantruong NGUYEN Jueping CAI Linyu WEI Jie CHU
In this letter, a piecewise linear (PWL) sigmoid function approximation based on the statistical distribution probability of the neurons' values in each layer is proposed to improve the network recognition accuracy with only addition circuit. The sigmoid function is first divided into three fixed regions, and then according to the neurons' values distribution probability, the curve in each region is segmented into sub-regions to reduce the approximation error and improve the recognition accuracy. Experiments performed on Xilinx's FPGA-XC7A200T for MNIST and CIFAR-10 datasets show that the proposed method achieves 97.45% recognition accuracy in DNN, 98.42% in CNN on MNIST and 72.22% on CIFAR-10, up to 0.84%, 0.57% and 2.01% higher than other approximation methods with only addition circuit.
A limited number of types of sound event occur in an acoustic scene and some sound events tend to co-occur in the scene; for example, the sound events “dishes” and “glass jingling” are likely to co-occur in the acoustic scene “cooking.” In this paper, we propose a method of sound event detection using graph Laplacian regularization with sound event co-occurrence taken into account. In the proposed method, the occurrences of sound events are expressed as a graph whose nodes indicate the frequencies of event occurrence and whose edges indicate the sound event co-occurrences. This graph representation is then utilized for the model training of sound event detection, which is optimized under an objective function with a regularization term considering the graph structure of sound event occurrence and co-occurrence. Evaluation experiments using the TUT Sound Events 2016 and 2017 detasets, and the TUT Acoustic Scenes 2016 dataset show that the proposed method improves the performance of sound event detection by 7.9 percentage points compared with the conventional CNN-BiGRU-based detection method in terms of the segment-based F1 score. In particular, the experimental results indicate that the proposed method enables the detection of co-occurring sound events more accurately than the conventional method.
Yuki SAITO Kei AKUZAWA Kentaro TACHIBANA
This paper presents a method for many-to-one voice conversion using phonetic posteriorgrams (PPGs) based on an adversarial training of deep neural networks (DNNs). A conventional method for many-to-one VC can learn a mapping function from input acoustic features to target acoustic features through separately trained DNN-based speech recognition and synthesis models. However, 1) the differences among speakers observed in PPGs and 2) an over-smoothing effect of generated acoustic features degrade the converted speech quality. Our method performs a domain-adversarial training of the recognition model for reducing the PPG differences. In addition, it incorporates a generative adversarial network into the training of the synthesis model for alleviating the over-smoothing effect. Unlike the conventional method, ours jointly trains the recognition and synthesis models so that they are optimized for many-to-one VC. Experimental evaluation demonstrates that the proposed method significantly improves the converted speech quality compared with conventional VC methods.
We propose a deep learning-based model for classifying pathological voices using a convolutional neural network and a feedforward neural network. The model uses combinations of heterogeneous parameters, including mel-frequency cepstral coefficients, linear predictive cepstral coefficients and higher-order statistics. We validate the accuracy of this model using the Massachusetts Eye and Ear Infirmary (MEEI) voice disorder database and the Saarbruecken Voice Database (SVD). Our model achieved an accuracy of 99.3% for MEEI and 75.18% for SVD. This model achieved an accuracy that is 7.18% higher than that of competitive models in previous studies.
Some non-acoustic modalities have the ability to reveal certain speech attributes that can be used for synthesizing speech signals without acoustic signals. This study validated the use of ultrasonic Doppler frequency shifts caused by facial movements to implement a silent speech interface system. A 40kHz ultrasonic beam is incident to a speaker's mouth region. The features derived from the demodulated received signals were used to estimate the speech parameters. A nonlinear regression approach was employed in this estimation where the relationship between ultrasonic features and corresponding speech is represented by deep neural networks (DNN). In this study, we investigated the discrepancies between the ultrasonic signals of audible and silent speech to validate the possibility for totally silent communication. Since reference speech signals are not available in silently mouthed ultrasonic signals, a nearest-neighbor search and alignment method was proposed, wherein alignment was achieved by determining the optimal pair of ultrasonic and audible features in the sense of a minimum mean square error criterion. The experimental results showed that the performance of the ultrasonic Doppler-based method was superior to that of EMG-based speech estimation, and was comparable to an image-based method.
Kazuya URAZOE Nobutaka KUROKI Yu KATO Shinya OHTANI Tetsuya HIROSE Masahiro NUMA
Convolutional neural network (CNN)-based image super-resolutions are widely used as a high-quality image-enhancement technique. However, in general, they show little to no luminance isotropy. Thus, we propose two methods, “Luminance Inversion Training (LIT)” and “Luminance Inversion Averaging (LIA),” to improve the luminance isotropy of CNN-based image super-resolutions. Experimental results of 2× image magnification show that the average peak signal-to-noise ratio (PSNR) using Luminance Inversion Averaging is about 0.15-0.20dB higher than that for the conventional super-resolution.
This paper proposes a method for heatmapping people who are involved in a group activity. Such people grouping is useful for understanding group activities. In prior work, people grouping is performed based on simple inflexible rules and schemes (e.g., based on proximity among people and with models representing only a constant number of people). In addition, several previous grouping methods require the results of action recognition for individual people, which may include erroneous results. On the other hand, our proposed heatmapping method can group any number of people who dynamically change their deployment. Our method can work independently of individual action recognition. A deep network for our proposed method consists of two input streams (i.e., RGB and human bounding-box images). This network outputs a heatmap representing pixelwise confidence values of the people grouping. Extensive exploration of appropriate parameters was conducted in order to optimize the input bounding-box images. As a result, we demonstrate the effectiveness of the proposed method for heatmapping people involved in group activities.
Linjun SUN Weijun LI Xin NING Liping ZHANG Xiaoli DONG Wei HE
This letter proposes a gradient-enhanced softmax supervisor for face recognition (FR) based on a deep convolutional neural network (DCNN). The proposed supervisor conducts the constant-normalized cosine to obtain the score for each class using a combination of the intra-class score and the soft maximum of the inter-class scores as the objective function. This mitigates the vanishing gradient problem in the conventional softmax classifier. The experiments on the public Labeled Faces in the Wild (LFW) database denote that the proposed supervisor achieves better results when compared with those achieved using the current state-of-the-art softmax-based approaches for FR.
Xin LONG Xiangrong ZENG Chen CHEN Huaxin XIAO Maojun ZHANG
The increase in computation cost and storage of convolutional neural networks (CNNs) severely hinders their applications on limited-resources devices in recent years. As a result, there is impending necessity to accelerate the networks by certain methods. In this paper, we propose a loss-driven method to prune redundant channels of CNNs. It identifies unimportant channels by using Taylor expansion technique regarding to scaling and shifting factors, and prunes those channels by fixed percentile threshold. By doing so, we obtain a compact network with less parameters and FLOPs consumption. In experimental section, we evaluate the proposed method in CIFAR datasets with several popular networks, including VGG-19, DenseNet-40 and ResNet-164, and experimental results demonstrate the proposed method is able to prune over 70% channels and parameters with no performance loss. Moreover, iterative pruning could be used to obtain more compact network.
Jiaquan WU Feiteng LI Zhijian CHEN Xiaoyan XIANG Yu PU
This paper presents an automated patient-specific ECG classification algorithm, which integrates long short-term memory (LSTM) and convolutional neural networks (CNN). While LSTM extracts the temporal features, such as the heart rate variance (HRV) and beat-to-beat correlation from sequential heartbeats, CNN captures detailed morphological characteristics of the current heartbeat. To further improve the classification performance, adaptive segmentation and re-sampling are applied to align the heartbeats of different patients with various heart rates. In addition, a novel clustering method is proposed to identify the most representative patterns from the common training data. Evaluated on the MIT-BIH arrhythmia database, our algorithm shows the superior accuracy for both ventricular ectopic beats (VEB) and supraventricular ectopic beats (SVEB) recognition. In particular, the sensitivity and positive predictive rate for SVEB increase by more than 8.2% and 8.8%, respectively, compared with the prior works. Since our patient-specific classification does not require manual feature extraction, it is potentially applicable to embedded devices for automatic and accurate arrhythmia monitoring.
Yibo FAN Leilei HUANG Kewei CHEN Xiaoyang ZENG
The neural network has been one of the most useful techniques in the area of speech recognition, language translation and image analysis in recent years. Long Short-Term Memory (LSTM), a popular type of recurrent neural networks (RNNs), has been widely implemented on CPUs and GPUs. However, those software implementations offer a poor parallelism while the existing hardware implementations lack in configurability. In order to make up for this gap, a highly configurable 7.62 GOP/s hardware implementation for LSTM is proposed in this paper. To achieve the goal, the work flow is carefully arranged to make the design compact and high-throughput; the structure is carefully organized to make the design configurable; the data buffering and compression strategy is carefully chosen to lower the bandwidth without increasing the complexity of structure; the data type, logistic sigmoid (σ) function and hyperbolic tangent (tanh) function is carefully optimized to balance the hardware cost and accuracy. This work achieves a performance of 7.62 GOP/s @ 238 MHz on XCZU6EG FPGA, which takes only 3K look-up table (LUT). Compared with the implementation on Intel Xeon E5-2620 CPU @ 2.10GHz, this work achieves about 90× speedup for small networks and 25× speed-up for large ones. The consumption of resources is also much less than that of the state-of-the-art works.
Ende WANG Yong LI Yuebin WANG Peng WANG Jinlei JIAO Xiaosheng YU
With the rapid development of technology and economy, the number of cars is increasing rapidly, which brings a series of traffic problems. To solve these traffic problems, the development of intelligent transportation systems are accelerated in many cities. While vehicles and their detailed information detection are great significance to the development of urban intelligent transportation system, the traditional vehicle detection algorithm is not satisfactory in the case of complex environment and high real-time requirement. The vehicle detection algorithm based on motion information is unable to detect the stationary vehicles in video. At present, the application of deep learning method in the task of target detection effectively improves the existing problems in traditional algorithms. However, there are few dataset for vehicles detailed information, i.e. driver, car inspection sign, copilot, plate and vehicle object, which are key information for intelligent transportation. This paper constructs a deep learning dataset containing 10,000 representative images about vehicles and their key information detection. Then, the SSD (Single Shot MultiBox Detector) target detection algorithm is improved and the improved algorithm is applied to the video surveillance system. The detection accuracy of small targets is improved by adding deconvolution modules to the detection network. The experimental results show that the proposed method can detect the vehicle, driver, car inspection sign, copilot and plate, which are vehicle key information, at the same time, and the improved algorithm in this paper has achieved better results in the accuracy and real-time performance of video surveillance than the SSD algorithm.
Hyun KWON Hyunsoo YOON Ki-Woong PARK
Malicious attackers on the Internet use automated attack programs to disrupt the use of services via mass spamming, unnecessary bulletin boarding, and account creation. Completely automated public turing test to tell computers and humans apart (CAPTCHA) is used as a security solution to prevent such automated attacks. CAPTCHA is a system that determines whether the user is a machine or a person by providing distorted letters, voices, and images that only humans can understand. However, new attack techniques such as optical character recognition (OCR) and deep neural networks (DNN) have been used to bypass CAPTCHA. In this paper, we propose a method to generate CAPTCHA images by using the fast-gradient sign method (FGSM), iterative FGSM (I-FGSM), and the DeepFool method. We used the CAPTCHA image provided by python as the dataset and Tensorflow as the machine learning library. The experimental results show that the CAPTCHA image generated via FGSM, I-FGSM, and DeepFool methods exhibits a 0% recognition rate with ε=0.15 for FGSM, a 0% recognition rate with α=0.1 with 50 iterations for I-FGSM, and a 45% recognition rate with 150 iterations for the DeepFool method.
Cheng XU Wei HAN Dongzhen WANG Daqing HUANG
In this paper, we propose a salient region detection method with multi-feature fusion and edge constraint. First, an image feature extraction and fusion network based on dense connection structure and multi-channel convolution channel is designed. Then, a multi-scale atrous convolution block is applied to enlarge reception field. Finally, to increase accuracy, a combined loss function including classified loss and edge loss is built for multi-task training. Experimental results verify the effectiveness of the proposed method.
Xinxin HAN Jian YE Jia LUO Haiying ZHOU
The triaxial accelerometer is one of the most important sensors for human activity recognition (HAR). It has been observed that the relations between the axes of a triaxial accelerometer plays a significant role in improving the accuracy of activity recognition. However, the existing research rarely focuses on these relations, but rather on the fusion of multiple sensors. In this paper, we propose a data fusion-based convolutional neural network (CNN) approach to effectively use the relations between the axes. We design a single-channel data fusion method and multichannel data fusion method in consideration of the diversified formats of sensor data. After obtaining the fused data, a CNN is used to extract the features and perform classification. The experiments show that the proposed approach has an advantage over the CNN in accuracy. Moreover, the single-channel model achieves an accuracy of 98.83% with the WISDM dataset, which is higher than that of state-of-the-art methods.
Tachanun KANGWANTRAKOOL Kobkrit VIRIYAYUDHAKORN Thanaruk THEERAMUNKONG
Most existing methods of effort estimations in software development are manual, labor-intensive and subjective, resulting in overestimation with bidding fail, and underestimation with money loss. This paper investigates effectiveness of sequence models on estimating development effort, in the form of man-months, from software project data. Four architectures; (1) Average word-vector with Multi-layer Perceptron (MLP), (2) Average word-vector with Support Vector Regression (SVR), (3) Gated Recurrent Unit (GRU) sequence model, and (4) Long short-term memory (LSTM) sequence model are compared in terms of man-months difference. The approach is evaluated using two datasets; ISEM (1,573 English software project descriptions) and ISBSG (9,100 software projects data), where the former is a raw text and the latter is a structured data table explained the characteristic of a software project. The LSTM sequence model achieves the lowest and the second lowest mean absolute errors, which are 0.705 and 14.077 man-months for ISEM and ISBSG datasets respectively. The MLP model achieves the lowest mean absolute errors which is 14.069 for ISBSG datasets.
Hyun KWON Hyunsoo YOON Ki-Woong PARK
We propose a multi-targeted backdoor that misleads different models to different classes. The method trains multiple models with data that include specific triggers that will be misclassified by different models into different classes. For example, an attacker can use a single multi-targeted backdoor sample to make model A recognize it as a stop sign, model B as a left-turn sign, model C as a right-turn sign, and model D as a U-turn sign. We used MNIST and Fashion-MNIST as experimental datasets and Tensorflow as a machine learning library. Experimental results show that the proposed method with a trigger can cause misclassification as different classes by different models with a 100% attack success rate on MNIST and Fashion-MNIST while maintaining the 97.18% and 91.1% accuracy, respectively, on data without a trigger.
Han-Ying LIN Chien-Chieh HUANG Wen-Whei CHANG Jen-Tzung CHIEN
This study presents a new method to exploit both accent and grouping structures of music in meter estimation. The system starts by extracting autocorrelation-based features that characterize accent periodicities. Based on the local boundary detection model, we construct grouping features that serve as additional cues for inferring meter. After the feature extraction, a multi-layer cascaded classifier based on neural network is incorporated to derive the most likely meter of input melody. Experiments on 7351 folk melodies in MIDI files indicate that the proposed system achieves an accuracy of 95.76% for classification into nine categories of meters.
This paper presents a Siamese architecture model with two identical Convolutional Neural Networks (CNNs) to identify code clones; two code fragments are represented as Abstract Syntax Trees (ASTs), CNN-based subnetworks extract feature vectors from the ASTs of pairwise code fragments, and the output layer produces how similar or dissimilar they are. Experimental results demonstrate that CNN-based feature extraction is effective in detecting code clones at source code or bytecode levels.